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1 . INTRODUCTION 



This paper reports a numerical experiment with an approach to inversion 
of a real matrix using a block partitioning structure. The study arises in 
the context of design of a large scale mathematical programming system for 
use within various computer environments. The scheme permits controlled 
explicit pagination of mathematical operations to coincide with boundaries 
specified by hardware memory management. The timing results presented in- 
dicate that maximizing work-per-page does not necessarily minimize total 
execution time as folklore would advise. Further, performance of an 
inversion scheme such as this is not always adequately estimated by 
classical means. The implications even for the simple cases reported here 
are potentially of importance to our further design work. 

Obtaining the explicit numerical inverse of a real matrix presents a 
classic problem in numerical methods which has received intense study in the 
literature; the wide spectrum of applications requiring inversion of matrices 
bears testimony to its fundamental nature. Numerous algorithms for matrix 
inversion have been presented and analyzed for error and speed [8,12,31] as 
have methods for error reduction by pivot selection and scaling [1,9], 
iterative improvement of accuracy [33], and exploitation of special features 
in the matrix to be inverted, especially sparseness [7,22,23,24,28,30,34,35], 
symmetry and band structure [26]. Studies of the form and complexity of the 
inversion process have included graph theoretic descriptions [4,28], statis- 
tical characterizations [19], exploitation of algorithmic parallelism [20], 
exercising special features of multiprogramming environments [17], and so 
forth . 



One of the active application areas for numerical matrix inversion has 
been in large scale mathematical programming. Most successful codes for 
large problems must resort to some form of inversion (and often, reinversion) 
technique based either on special structure in the problem recognized a 
priori, or on special storage mechanisms for the inverse matrix. The moti- 
vation for these efforts is that the available main memory on modern digital 
computers is not sufficiently large to store the inverse explicitly for 
problems of contemporary scale (say, thousands of rows). 

Unfortunately, many methods based on a priori structure in problems are 
inherently ad hoc in nature - decomposition methods [2,6,11,25] are con- 
sidered by many to exemplify this shortcoming. On the other hand, for some 
important classes of problems sharing a common special structure, new tech- 
niques developed in concert with new data structures for problem represen- 
tation have led to remarkable breakthroughs. In fact, sometimes a 
triangulated basis is superior to an explicit inverse. As an example, a 
primal (simplex) network code [3] has triangulated node-arc incident matrices 
of rank 10,000 in 20 seconds (IBM 360/67) with no rounding error. Other 
successful special methods have included element generation by generalized 
upper bounding [5], factorization [15], and other compact working inverse 
basis methods [14]. 

Special storage mechanisms for the inverse have typically included 
columnwise product form representation (with "ETA" columns) [18] and modifi- 
cations attempting to maintain sparcity in the inverse [13]. Also, inversion 
and manipulation of sparce matrices is possible based on other structural 
decompositions, including doubly linked lists, bit maps, etc. [22,29]. 

The introduction of computers with memory organized in pages which are 
transferred in bulk swapping transactions between high speed magnetic devices 
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and main memory has led many to conclude that the vastly increased virtual 
memory capacity will allow design of large scale algorithms ignorant of 
main, or cache, memory configurations. Unfortunately, algorithms designed 
originally for conventional systems have performed very poorly in the paged 
environment. The study of dynamic program behavior within paging systems 
indicates that the global cost of page swaps is far from negligible. Thus, 
with a given amount of work to perform in inverting a matrix, design 
criteria now include consideration of both minimizing page swaps and 
maximizing the work performed on each page during its cache residence [16, 

17]. 

This paper presents an approach to inverting a real matrix using a 
partitioning structure and block pivots ("B pivots") which perform bulk 
inversions of interior submatrices [21,27]. The technique is demonstrated 
to have theoretical computational efficiency in terms of long (floating 
point) operations equal to classical methods, and parametric high resolution 
timing of a FORTRAN routine executed on a dedicated processor with blocks of 
varied size is given, with some discussion of surprising behavior. 

This scheme employs explicit pagination of the mathematical operations 
in inversion and is potentially useful for large scale mathematical programming. 
The reason that the empirical study concentrates on a dedicated processor is 
two-fold. For contemporary large scale programming applications, the wall 
clock time, rather than central processor active compute time, often dominates 
the attention of the user - these applications are capable of laying seige to 
the entire computer system, whether paged or not. Second, we are interested 
in the performance characteristics of various numerical techniques imbedded 
in a large scale mathematical programming code now under development. The 
design of the new system, based on parametric studies such as this, includes 



3 



the provision for robust performance within varied computer environments. 

Similarly, the examples are chosen to be full rank and dense in the 
interests of emphasizing the execution performance of the partitioned 
algorithm, rather than that of exercising some supporting storage structure, 
list mechanism, or other feature not under study here. Even for large, 
sparce problems, however, nonzero elements can often be aggregated to dense 
blocks over subsets of the matrix rows and columns. 

The advent of minicomputers, micro computers, and various multiple 
processor configurations portends further application of B-pivot inversion 
even for relatively small or intermediate scale problems. 
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2. INVERSION TRANSFORMATION 



The classical inversion method used here is in-situ Gauss-Jordan 
pivoting, chosen for speed, generality, and storage economy. To illustrate, 
let A be the nonsingular real square matrix of rank n to be inverted, 
with a^. a general element of A. An in-situ scaler pivot on a^ is 
defined as follows for a^ t 0 with indicating elemental replace- 



ment: 
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An inverse is developed in place by performing n sequential pivotal 
transformations of this type on the main diagonal elements a , p=l,..,n. 

r r 

However, since the accuracy of the computations can be adversely affected by 
the relative magnitude of the elements in A , and since the transformation 
is not defined for a^ = 0 , a pivot selection strategy is usually employed 
in inversion. 

Notably, "partial" pivots use sequential rows for k and in each select 
i by finding the largest absolute element in a previously unused column. 
"Full" pivots select the largest absolute matrix element remaining in an 
unused row and column. Often equilibration, or scaling, of the matrix columns 
for partial pivoting, and of rows and columns for full pivoting can further 
increase accuracy. 
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Both partial and full pivot strategies require that the inverse be 
recovered by recording the coordinates for each pivot and using this history 
to permute rows and columns of the result matrix left by pivoting. If 
r = k and c = l for pivot p, p=l,..,n , then inverse row c. is 

r H 

located in matrix row r. , and inverse column r. is located in matrix 

1 J 

column c-. Thus, the increased accuracy of these pivot selection schemes 

J 

comes at the cost of extra scanning of elements in pivot selection tourna- 
ments, bookkeeping and in reordering the resulting inverse. 

A simple computation timing estimate for Gauss-Jordan pivoting can be 
arrived at by concentrating only on the floating point arithmetic required 
by (1).[^3] Assuming an efficient program utilizing partial results in the 

pivot rows, or columns, in computing the general elements, a total of n 

2 

pivots will be performed, each requiring 1 + 2 ( n - 1) multiplications and 
2 

(n - 1) additions. Aggregating operations for each pivot gives the total 

W( n ) = 2n 2 - 2n + 1 , (2) 

and a complete inversion time of n W( n ). For simplicity the effect of 
pivot selection strategy and other program overhead has been ignored. 

Now let us introduce B-pivots to the inversion by partitioning A into 
submatrices as follows: 

(3) 
b 

n 

If A-|i is b x b , an in-si tu B-pivot on A^ is defined as follows 
for nonsingular A^ : 

An An" 1 (B-pivot block) (4) 



*21 |*22 
b 
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(B-pivot row) 



A 1 2 ^ A 11 A 12 
A 21 ^ “ A 21 A ll _1 (B-pivot column) 

A 22 ^ A 22 " A 21 A ll -1 A 12 (9 enera1 elements) 

Since is found by using the Gauss-Jordan transformations in the 

first b rows and columns, it is clear that if we continue with another 
B-pivot in the next n - b rows and columns of A ^ have 

been formed. The resulting inverse in terms of the original elements in 
(3) is: 



A ’ 1 = 


A 11 1 + A 11 1 A 12 B A 21 


! 

A n 1 | _A n 1 A i 2 b 
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» 



and may be verified by multiplication. 

This B-pivot process may be continued analytically to any number P of 
B-pivots of varying size b p , p=l,..,P, indeed, with b-| = ...=b p = 1, and 
P = n, or with b-j = n, and P = 1, the process is simply the Gauss-Jordan 
transformation. 

A computation timing estimate using the simple approach producing (2) 
for a B-pivot of size b will give 

b W(b) = 2b 3 - 2b 2 + b 

operations for the inversion of the pivot block in (4), for the B-pivot row 
b (n - b) (2b - 1) operations, an equal number in the B-pivot column, and 
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2 2 

(n - b) (2b - 1) + (n - b) floating point computations for the general 
elements. This again sssumes an efficient program which uses partial results 
in the B-pivot row, or column, to compute general elements. For the complete 
B-pivot we have the total 

b (2n 2 - 2n + 1) = bW( n ) . (5) 

Thus, the B-pivot approach nominally requires no more computational effort, 
stated in terms of floating point operations, than scaler pivots. 



3. APPLICATION SCHEME 



P;1 

Let b.| = b 2 = ‘** =b p-l ’ b p = n - I bp . This "fixed block" strategy 
requires nonsingular A-ji»...,A . Applying this method to test matrices of 

dimensionality n = 10(10)50 with bp = 1(1 )n in a FORTRAN test program 
produced the timing results shown in Figures 1 and 2 when run on a dedicated 
IBM 360/67. The times shown are for active computation with memory resident 
matrices and are precise to four significant digits. As indicated, both 
partial and full pivot selection strategies within pivot blocks were tested. 

The FORTRAN program specifies a matrix of appropriate dimensionality, 
the block size to be used, and then exercises subroutines to perform the 
block inversions, row block transformations, column block transformations, 
and general block transformations. The program is compiled and optimized 
for run time efficiency by the IBM FORTRAN (H) compiler. 

The timing prediction of (5) appears to hold at least as a polynomial 
form for the full block (b p = n) execution times in Figure 1. However, 
note the surprising gain in speed for intermediate sizes of b in Figure 2. 
This is partially due to the work avoided in restricting to the pivot blocks 
the range of scalar pivot selection and subsequent matrix permutations. 
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Conversely, for very small B-pivots and those requiring a last B-pivot much 
smaller than b, an "odd B-pivot", the housekeeping overhead of the program 
increases the execution time significantly. Thus, it may be advantageous to 
choose block sizes which partition the problem into blocks of homogeneous 
size, rather than to fill a page. 

For these examples, the classical assumption that long floating point 
operations required by an algorithm dominate performance on a computer appears 
to be contradicted. It is certainly true that the time required for floating 
point operations relative to others such as register loads, compares, and so 
forth, has decreased greatly from the days of software arithmetic subroutines 
to contemporary floating point hardware (such as on the IBM 360/67 used here). 

As a practical matter, resident memory for three blocks is required for 
each complete B-pivot as shown in (4). The program can easily be modified 
to emulate hardware delays caused by page swaps, or other interference by a 
paging mechanism. The times reported do not reflect such modifications. 

4. DISCUSSION 

For matrix inversion and other matrix arithmetic the "fixed block" 
approach allows b to be chosen to fit within a page, or cache area, and 
minimize boundary violations by the computation. However, the folklore 
which specifies that each page should be as nearly filled as possible is not 
necessarily true. The problem dimension (or other consideration) can in 
some cases dictate a less than full page scheme for fastest (or best) 
execution . 

It is important to note that the block approach generates, a priori, 
demands for the matrix blocks from the out-of -memory device. Thus, an 
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agenda of input/output operations can be used to achieve simultaneity with 
computation. This can be much faster than a memory page fault and interrupt 
mechanism. Even in traditional single user environments the block size may 
be chosen to balance access-transfer time for slow magnetic devices with 
computation time, minimizing retrieval overhead with little memory cost. 

Note from Figures 1 and 2 that even a relatively small block pivot requires 
sufficient computation time (e.g., more than one second for n = 40) to 
permit simultaneous access to a disc or drum device. 

The fixed blocks provide a convenient parcelling of matrix manipulation 
effort as well as storage, and permit for many applications a more convenient 
access structure than traditional column by column methods. This is useful 
for large mathematical programming problems with block angular, staircase, 
or other special block structure, [14]. These B-pivots on fixed blocks may 
also be used to equitably distribute computation among available parallel 
processors in sizeable bulk to permit relatively lengthy independent 
operation . 



5. CONCLUSION 

The usefulness and speed of B-pivot schemes come with one important 
disadvantage: the pivot blocks must be nonsingular mathematically and in 

the stricter numerical sense. Since scalar pivot selection is restricted 
to each pivot block, one should try to assemble the matrix A accordingly. 
For instance, in large linear programming problems a history of algorithm 

progress can be used to provide Ap^ . In fact, B-pivots may be 

considered in this context as an extension of the concept of product form 
inverse. The B-pivot method works well for such problems, providing a 
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convenient tool for fast inversion, or reinversion, of kernels embedded in 
technological coefficients. 

Other classes of matrices exhibit a natural diagonal dominance permitting 
effective B-pivot inversion. For instance, the noise covariance matrix has 
been proposed for such use in [21], which includes an extensive group 
theoretic characterization of admissabil ity. Also, algorithms have been 
proposed for rearranging matrices to block angular form [32]. 

The accuracy of B-pivots may be improved by computing inner products 
with extended precision and special ordered addition. In some special 
cases of A, such as zero-one or sparse conditions as 



A = 


A n 


' 0 
1 ” 




A 21 


l a 
1 rt 22 






_j 



significant effort can be avoided by the B-pivot approach coupled with appro- 
priate modifications of program logic. Several approaches to avoiding B-pivot 
singularity failures have been suggested, including "dynamic" blocks, "gang" 
blocks, pivot block selection procedure, and a matrix construction to be 
performed concurrently with a scaling algorithm as in [9]. Fortunately, 
none has been necessary in applications to date. 

Continuing research focuses on block triangulation and dynamic factor- 
ization schemes for large scale mathematical programming, combined with 
generalized upper bounding in a hybrid system with both explicit and logically 
generated elements. It is becoming increasingly clear from experiments such 
as this that the (logical) algorithm and data representation of the program, 
and the (physical) organization of computations performed on the host computer 
under operating system control interact in subtle ways to give aggregate per- 
formance which is often counter-intuitive and seldom improved by ignoring the 
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details of either. The effect can be so pronounced that we are reexamining 
several classical techniques in this new light. 
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Figure 1 - Execution times for maximal "fixed block" pivots b = 
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FIGURE 2 - Execution times for "fixed block" B-pivots 
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